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Abstract. A new stream of research was born in the last decade with 
the goal of mining itemsets of interest using Constraint Programming 
(CP). This has promoted a natural way to combine complex constraints 
in a highly flexible manner. Although CP state-of-the-art solutions for- 
mulate the task using Boolean variables, the few attempts to adopt 
propositional Satisfiability (SAT) provided an unsatisfactory performance 
This work deepens the study on when and how to use SAT for the 
frequent itemset mining (FIM) problem by defining different encodings 
with multiple task-driven enumeration options and search strategies. Al- 
though for the majority of the scenarios SAT-based solutions appear to 
be non-competitive with CP poors, results show a variety of interesting 
cases where SAT encodings are the best option. 



1 Introduction 

Recent works [34, 24. 17. 9] show the cross-fertilization between Pattern Mining 
(PM) tasks, the discovery of patterns within large datasets, and Constraint Pro- 
gramming (CP), the programming paradigm wherein relations between variables 
are stated declarativoly in the form of constraints. Traditional greedy approaches 
for PM contrast with optimal approaches developed within the artificial intelli- 
gence community. While traditional research aims at developing highly optimized 
and scalable implementations that are tailored towards specific tasks, CP cm- 
ploys a generic and declarative approach to model and mine patterns. This has 
motivated the adoption of high-level modeling languages or general solvers (that 
specify what the problem is, rather than outlining how a solution should be 
computed) for the flexible definition of constraints, which are critical for many 
PM applications and domains [22,24,21]. 

The core underlying task for every PM task is to count. Coimting is required 
for every constraint: a specific pattern shape is only of interest above a minimum 
support threshold. However, here resides the efficiency bottleneck of CP solvers, 
whic;h need to deal with large counting options to solve frequency-based inequal- 
ities. Even though state-of-the-art CP-based solutions are not yet as scalable 
as traditional PM solutions [34], they can be used for local scans [32]. for the 
expressive definition of user-driven and non-trivial constraints [30], and their 
optimal search nature have led to significant performance improvements in a 
wide-diversity of problems [22] . 
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In this work, we propose to study whether a specific class of CP solvers is 
tailored to solve this core task. The target class of solvers, propositional Sat- 
isfiability (SAT) solvers, aim to solve the Boolean SAT problem, which is the 
problem of finding an assignment for a set of Boolean variables that evaluates 
a target formula (usually restricted to a conjunctive normal form) to true. Al- 
though Boolean encodings are commonly adopted by CP solvers for PM tasks 
[17], to the best of our knowledge SAT-based solutions have only been proposed 
to a particular subclass of PM problems aiming at discovering a fixed number 
of patterns (fc-PM) [30,21]. 

Based on the critical need for an efficient CP solver for PM tasks, this work 
undertakes an extensive review to understand how SAT solutions compare to 
state-of-the-art CP-based alternatives. The target research question is: how SAT 
behaves in comparison to more general CP frameworks for PM underlying tasks? 
In section 2 the problem is defined and motivated. Section 3 introduces differ- 
ent SAT encodings with multiple enumeration and search options. Section 4 
describes the properties of the adopted implementations and conducts an exper- 
imental analysis. The results arc; discussed and their implications synthesized. 
Section 5 reviews related research with relevant contributions to the target prob- 
lem. Finally, concluding remarks and potential prospective research directions 
are presented. 

2 Problem Definition 

Definition 1. Let I be a finite set, called the set of items, letT he a finite set, 
called the set of transactions, and let I be an itemset, I C I. A transaction 
t gT over I is a pair {tid, I), with tid an identifier and I CI. 

Definition 2. An itemset database D over X is a finite set of transactions. 

In a sim,plified way, a transactional dataset is a multi-set of itemsets (being the 
language of itemsets Li = Equivalently, an itemset database D can he 

seen as a binary matrix of size mxn, wherem=\T\ andn=\I\, withDu G {0,1}, 
such that: 

D = {{t,I)\tGT,ICX,3ieiDu = l} (1) 
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Table 1: Illustrative itemset database: compact and Boolean views 

A small example of an itemset database is given in Table 1. A traditional 
example of an itemset database is the supermarket shopping, where each trans- 
action corresponds to a transacted basket and every item to a bought product. 
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However, common attribute-value tables can be easily converted into an itemset 
database. Both categorical data (where every attribute-value pair corresponds 
to an item) and numeric data (following an expressive discretization technique 
[32]) can be converted, with each row being mapped into a transaction. 

Definition 3. The coverage ^d{i) of an itemset I is the set of all transactions 
in which the itemset occurs: 

VD{I) ={t^T\ y^eIDt^ = 1} (2) 

Definition 4. The support of an itemset I, denoted supoil), is its coverage 
size: |<Pd(/)I; the frequency of an itemset, denoted freqoil), is ^'^^^^^^/n- 

Considering the itemset database from Table 1, we have <^d({J, iV}) = 
{ti,t5}, supd{{J,N}) =\{t4,t5}\= 2 and freqDi{J,N}) = 0.3(3). Now the tar- 
get PM problem can be formulated. 

Definition 5. Given an itemset database D and a m/mimurn support threshold 
9, the frequent itemset mining problem consists of computing the set: 

{I \ I CI, supD{I)>e} (3) 

Definition 6. Let a frequent itemset be an itemset with supd{I) > 9, a pattern 

is a frequent itemset that satisfies any other placed constraints over D. 

Considering the database from Table 1 and fixing 9=3, {D, H, J} and {E, N} 
are examples of frequent itemsets. Finding frequent itemsets is the core under- 
lying task of every pattern discovery problem and, additionally, form the basis 
for association-rule analysis, classification, regression, and clustering. FIM was 
initially proposed in 1993 by Agrawal et al. [1]. 

2.1 CP Mapping 

Flexible constraint-based methods are key for PM as they: 

— Focus on what the problem is, rather than outlining how a solution should be 
computed, is powerful enough to be used across a wide variety of applications 
and domains [22, 24, 21] as it suppresses the need of adapting the underlying 
traditional procedures in order to accommodate new types of constraints. 

— Provide an easy method to adapt the search by changing the declarative 
specification to combine and add new constraints. This not only supports 
user-driven selection of which patterns are of interest, but FIM-based meth- 
ods that may require iterative refinements as constraint-driven clustering 
and pattern-based classification. 

— Can expressively capture background knowledge to prune the explosion of 
spurious and potentially non- interesting patterns [23] . These constraints may 
include properties from both closed pattern mining [10] and domain-driven 
pattern mining [3], which aim to incrementally improve results by refining 
the way patterns and domain-knowledge is represented. 
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— Support the introduction of a wide variety of expressive constraints, as 
pattern-set constraints (e.g. global patterns imposing overlapping relaxations 
over local patterns), with key impUcations on a wide- variety of problems 
ranging from web mining to bioinformatics [9, 17]. 

Although CP models Uke ConQueSt [10] or MusicDFS [37] already support a 
predefined number of constraints, they do not allow for the expressive definition 
and combination of constraints as CP approaches like FIMCP [17], PattCP [21], 
or GeMini [30]. The selected mapping is based on the constraint encodings of 
the later approaches. In these CP models for the FIM task, a Boolean variable 
is used for every individual item Ij and for every transaction T^. One assign- 
ment of values to all /, and Tt corresponds to one itemset and its corresponding 
transaction set. 

Definition 7. Let T be a transaction set, T CT. An itemset I can be defined 
by the true item variables: li = 1 if i £ I and It = if i ^ I. A transaction 
set T can be defined by the set of transactions that are covered by the itemset, 
T = ipD{I). Thus, Tt = lifte ipD{I). 

Corollary 1. The FIM task can now be viewed as the computation of valid and 
frequent (/, T) combinations, i.e. on finding the set: 

{{I,T)\ICI,TCT,T = ^D{I),\T\>e} (4) 

We refer to T = (fin (I) as the coverage condition while | T |> 6* expresses 
a support condition. These conditions restrict the valid variable assignments. 
Note that given that neither I nor T are fixed, there can be an arbitrary high 
number of valid attributions to /j and Tt resulting in different (/, T) tuples that 

satisfy both constraints. 

Property 1 (Coverage Constraint). Given a database D, an itemset / and a 
transaction set T, then 

T = ipD{I) ^ (Vter ■Tt = l^ Sieih{l - Du) = 0) (5) 

where h e {0, 1}, T* G {0, 1} and /j = 1 if i e / and Tt = 1 if t e T [34]. 

Property 2 (Frequency Constraint). Given a database D, a transaction set T 
and a threshold 6, then 

\T\>e<^ SteT > (6) 
where Tt G {0, 1} and = 1 if t G T [34]. 

We can now model the frequent itemset mining problem as a combina- 
tion of the coverage constraint and the frequency constraint. To illustrate this, 
[17] provides an example of a potential implementation in Essence [20] (solver- 
independent modeling language): 

1: given Freq : int, TDB : matrix[int(l..NrT),int(l..Nrl)] of int(0..1) 
2: find I : matrix[int(l..Nrl)] of bool, T : matrix[int(l..NrT)] of bool 
3: such that forall t: int(l..NrT) 

4: T[t] <^ ((sum i: int(l..Nrl).(l-TDB[t,i]) l[i])i=0) (Coverage Constraint) 
5: (sunn t: int(l..NrT).T[t])>Freq (Frequency Constraint) 
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3 Mapping 

Since current state-of-the-art CP formulations for FIM rely on Boolean variables, 
it is important to understand the impact of adopting approaches dedicated to 
solve Boolean formulae. In particular, efficient and scalable SAT solvers devel- 
oped over the last decades have contributed to dramatic advances in the ability to 
solve problem instances involving thousands of variables and millions of clauses 
[30]. Additionally, a potential mapping of previous constraints into a conjunctive 
normal formula contains many binary clauses, which can be handled in a very 
effective manner by SAT solvers. For these reasons, this work encodes FIM as a 
SAT formula. Enumeration, search, encoding alternatives, and tunning options 
are covered. 

3.1 Core Encoding 

The previously introduced coverage and frequency constraints map the FIM 
problem into a high-level CP language. Next, an extended SAT encoding is 
proposed. SAT clauses and pseudo-Boolean constraints will be interchangeably 
adopted to facilitate their traceability. 

Corollary 2 (Coverage Encoding). Given a database D, an itemset I and a 

transaction set T , the SAT encoding for the coverage constraint is: 

Ater(A.e/|D,, V A (T* V {yr^i\Dji))) (7) 
Proof . This formula is derived from equation (5) by: 

1. rewriting the coverage sum: 
Sieihil - Du) = 

2. decomposing the equivalence into CNF: 

Tt — I /\ieI\Du^^i 

^ ^iei\Du{--Tt V -7i) A {Tt V {Vi^iiD.Ji}) 

3. encoding the quantifier VJ^-y- into m sets of clauses: 

o AteriKeilDuhTt V A {Tt V (V.e/p, J,))) 

Complexity. Considering n the size of I and m the number of transactions in 
D, we have the following properties: an upper bound of m x n binary clauses, 
and m clauses with a maximum of n -|- 1 literals. 

This SAT formula guarantees the consistency of T/ and attributions. 

To encode the frequency constraints, we need to extend the SAT notation to 
include pseudo-Boolean (PB) constraints, which are extensions of SAT clauses 
that support cardinality constraints and weighted literals. Additionally, we need 
to adapt equation (6) into a reified frequency constraint for a more focused search 
of space, as discussed in [22]. This model is equivalent to the original model. 
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Property 3 (Reified Frequency Constraint). Given a database D, a transac- 
tion set T and a threshold 9, then: 

\T\>e^ (Vi6i : 7i = 1 ^ Et.Du > 6) (8) 

Corollary 3 (Frequency Encoding). Given a database D, a transaction set 
T and a threshold 9, then the PB encoding for the frequency constraint is: 

AieI{9^I^ + SteT\D,,Tt > 9) (9) 

Proof. This formula is derived from equation (8) by: 

1. rewriting the frequency inequality into a cardinality constraint: 
St.Du > 9 

^ ^teT\Du'^t > 9 

2. decomposing the implication: 

Ii = l^ St.Du > 9 

o J, = 1 V {SteT\D,^Tt > 9) 

^^hy{AteT\DuTt>9) 

3. mapping the previous formula into conjunctive normal form or, as below, 
into pseudo-Boolean constraints: 

-^li V {St^T\DuTt > 9) 

o V {li:teT\DuTt > 1) 

^9^1, + S,^T\D,,Tt>9 

Potential SAT encodings for the -i/j V {St^T\DtiTt > 9) constraint include 
the use of sequential counters [36], binary decision diagrams [19], sorting 
networks, or cardinality networks [4]. Note, however, that some of these en- 
codings are not polynomial. Since solving these constraints is the core task 
of our problem, we benefit from solvers oriented to solve them. Moreover, 
a possible encoding using cardinality constraints would suffer from an ad- 
ditional complexity, since we would need to translate: -i/j V Cardconstraint- 
For these reasons, a pseudo-Boolean representation is the natural choice; 

4. translating the quantifier V^gj; into n sets of clauses: 
Viex(/ = 1 St.Du > 9) 

o Aiei{9^Ii + SteT\D,,Tt > 9) 

Complexity. The incremental complexity added by this formula is n pseudo- 
Boolean constraints (of the form >) with a maximum of m unweighted literals 
and one weighted literal. 

Concluding, equations (7) and (9) describe the resulting SAT encoding, which 

has: i) m -\- n variables; ii) 0{mn) clauses with mn being binary clauses and m 
clauses having 0{n) variables; and Hi) n pseudo-Boolean constraints with 0{m) 
variables. 
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3.2 Enumeration Options 

Using SAT solvers on the previously defined encoding would output either one 
frequent itemset (/^ literals assigned as true) or unsat, meaning that no itemset 
satisfies the constraints. Understandably, model enumeration needs to be present 
in order to solve the FIM problem, that is, to find the set of all frequent itemsets. 

Note that an alternative strategy to define all frequent itemsets at the en- 
coding level would mean an exponential growth of the search space, potentially 
behaving similarly to a simple enumeration strategy. However, since we can easily 
opt to adopt more expressive ways of performing enumerations, with significant 
space cuttings on each iteration, we only target enumeration strategies. 

For this purpose, we need to introduce two properties of frequency that allow 
for pruning substantial parts of the search space, and dual formulation for FIM. 

Definition 8. Let I be a set of items. A transaction {tid, J) contains I, denoted 
I C {tid, J) J if I C J. FIM approaches rely heavily on: 

— monotonicity of frequency: if I Q J, then the frequency of J is bounded 
from above by the frequency of I; 

— anti-monotonicity of frequency: if I ^ J and I is not frequent, then J is 

also not frequent. 

Definition 9. Given an itemset database D and a minimum support threshold 
9, the dual-FIM task centered on non-frequent itemsets is to compute: 

2^/{I\IQl,supD{I)<e] (10) 

Table 2 rely on these notions to define different enumeration options. Explicit 
and compact negations based on the (anti-)monotonic property are proposed as 
well as further directions relying on alternative frequency-based properties. 

3.3 Search Options 

Previous enumeration strategies may not result in significant improvements if 
SAT iterations do not put any guarantee on the granularity of the itemsets 
found. For instance, if the subsets negation strategy is adopted and if the solver 
tends to output finer itemsets, the adoption of this strategy will not be relevant. 
The same is valid for the supersets negation strategy. To describe these search 
options, a new concept needs to be introduced. 

Definition 10. A maximal frequent itemset is a frequent itemset that also 
satisfies: 

yp^i :| ip{i') |< e (11) 

All itemsets that are a superset of a maximal itemset are infrequent, while 
all itemsets that arc subsets arc frequent. Maximal frequent itemsets are the top 
border between itemsets that are frequent and not frequent. 

In an itemset database where ABCD is the only maximal frequent itemset, a 
SAT solution using subsets negation may either find all frequent itemsets within 
1 iteration or across Efs'iC^=15 iterations. The following three search options 
guarantee an upper bound on the number of performed iterations. 
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Option Description 



Example 
X^{A,B,C\D} 



Observation. Adding a clause with the negation of the solution on every 
iteration hampers the search, as it may result in redundant searches with 
multiple assignments of trans act ion- based literals Tj for the same itemset. 
Method. The added clause must only include the negation of item-based 
literals li (i.e. exclude all transaction-leased literals Tt). 



Simple 



sup{{A,C})>e 

{^AyBy^CyD) 



Observation 1. If an itemset is frequent, then its subsets are also frequent 
(monotonicity) . 

Method 1. Add the negation of the found itemset as well as the negation 
of its subsets, so the number of SAT iterations can be largely reduced. 
Observation 2. Negating every frequent itemset may result in an imprac- 
tical growth of the number of clauses during the search. For a very simple 
itemset database with mn ^ 100. the number of added clauses can reach 
a thousand (significantly higher than the initial number of clauses used to 
encode the problem). 

Method 2. Compact the set of clauses obtained within one iteration in only 
one clause. That is, in next run the solver must be able to select, at least, 
one item that is not included in the previously found frequent itemsets: 

I > Vii^j.Ji 



Subsets 
Negation 



sup{{A,C})>e 
{^A\/B\/^CyD) 
A(-.AVBVCVD) 
A(AVBV-.CVD) 



sup{{A,C})>0 
{By D) 



Observations. For fixed Ii literal attributions satisfying the coverage con- 
straints, if no combination of Tt satisfies the frequency constraints, new 
clauses can be directly learned corresponding to the supersets of the found 
non-frequent itemset. This property was found to be critical for a dual- 
FIM formulation. 

Methods. Adopt dual-FIM problem for medium-lo-low frequency thresh- 
olds, and negate supersets negation in a similar fashion as subsets negation 
(items of non- frequent itemsets cannot jointly appear): 



Supersets 
Negation 



Note that the choice of when to adopt the FIM-dual should be dynam- 
ically made based on the dataset properties (mainly density, but also 
transactions-to-items ratio) and on the inputted frequency. 

Pointer 1. Adopt advanced enumeration strategy centered on implicit 
methods as, for instance, cube representations [29]. 

Pointer 2. Use the monotonicity and anti-monoticity properties to affect 
the exploitation of the structure of conflicts within a SAT solver. 
Pointer 3. Exploit important relationships between itemset frequencies 
beside monotonicity. These properties should not only affect the iterative 
Others insertions, but can be included as constraints in the initial encoding. This 
may result in significant improvements as the previous strategies. For ex- 
ample, in the MAXMINER algorithm [7], relations of the following form 
are exploited: /reg({a, b, c}) — freq{{a. h}) + /req({a, c}) — /req({a}). 
There are many more relations between the frequencies of itemsets. See 
[13] for extensions based on the inclusion-exclusion principle. For a gener- 
alization to other measures besides frequency, see [35]. 



sup{{A,C})<e 
{^AyBV^CyD) 

l^AV^BV^CVD) 
(-^AVBV-^CV^D) 
i^Av^BV-^Cv^D) 

sup{{A,C})<9 



Table 2: Enumeration Options 



Largest-to-Shortest Maximal (LSM) search 
This search option guarantees that the number of iterations equals the number 
of maximal frequent itemsets by mapping the previous decision problem into 
an optimAzaMon problem. Each iteration returns a maximal frequent itemset, 
starting with the longest maximal frequent itemset until reaching the shortest 
maximal frequent itemset. This is done by defining the following goal function: 



min : E^-^Ii^ 



(12) 
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and, understandably, by adopting one of the expressive enumeration strategies 
based on the (anti-)nionotonicity property. Otherwise^, the goal of finding maxi- 
mal frequent itemsets would be no longer a valid effort. 

Constrained Monotonic Growing (CMG) search 
One of the undesirable consequences of the LSM search method is the need to 
fully exploit the search space within each iteration. That is, for an optimum value 
for the goal function, several (potentially maximal) frequent itemsets need to be 
incrementally compared. Additionally, most of the learned behavior needs to be 
restarted across iterations. This is particularly critical for itemset databases with 
either many maximal frequent itemsets (usually the case) or with multiple fine 
maximal frequent itemsets. 

In order to overcome the referred problems, we propose CMG, a more re- 
laxed search option formulated over decision tasks that do not require the larger 
maximal frequent itemsets to be found early. The CMG is based on two types 
of searches: a-search and /3-search. An a-search is a simple SAT iteration (out- 
putting one frequent itemset). As a result, not only a clause expressively negating 
its subsets is added, but also the itemset itself: 

temporaryClause Vj|/^{/i} 

After an a-scarch, a set of /3-scarches are performed with the goal of finding 
larger frequent itemsets until a maximal frequent itemset is found (with unsat 
being returned). When this happens, the clauses related to the previous itemset's 
items are removed, the found frequent itemset is expressively negated and a new 
a-search is performed. This behavior is repeated until the a-search returns unsat. 
An illustrative instantiation segment of CMG behavior for an itemset database 
with I = {A, B, C, D, E} is presented next: 

{A} H a-searchi) 

temporaryClause ^ {B V C V D V E) 
{A, B, E} H P-search{{A, ?}) 
temporaryClause <^ {C W D) 
unsat H j5-search{{A, B, E, ?}) 

learnedClauses •<— learnedClauses A temporaryClause 

temporaryClause <— 
{C, E} -\ a-searchQ 

Length Decreasing (LD) search 

A third search option, LD search, benefits from a more focused search of space 
as it fixes the length of the itemsets to be found. To guarantee that only maximal 
frequent itemsets are selected, LD initially fixes this length to n and iteratively 
decrements it. Alternative length settings arc possible if a separate initial scan- 
ning to the itemset database guarantees upper and lower bound restrictions on 
the length of maximal frequent itemsets. 

LD accomplishes this behavior by adding and removing equalities of the 
form Sili = k, with k G {l,..,n}. However, since only few solvers support 
the addition-removal of pseudo-Boolean constraints, a new set of variables A = 
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{Ai, ..,An} is added into the following additional constraint: 

+ E^Ai = k (13) 

In the simplest mode, all Ai variables arc initially assigned to false until no 
maximal frequent itemset with length n is found (unsat output). Incrementally 
each Ai is reversed to true, so finer maximal frequent itemsets can be found. 
Understandably, either the subsets negation enumeration option or the encoding 
of equation (11) needs to be in place for an efficient search. 



3.4 Mapping Restrictions 

The need to test the proposed options against different pseudo-Boolean solvers 
may require adaptations over the initial encoding. For simplicity, this section 
only describes how to support the removal of clauses during enumeration. In 
appendix, additional adaptations to deal with non-negated variables are covered. 



Encoding Description 



Method. Recur to: i) n additional variables {N — {Ni^ .., AT^}), to ii) n additional clauses: 

A,e/{/, VA^O (14) 



Incremental 
Clauses 



and to Hi) manipulations over the vector of assumptions that SAT solvers usually disclose. 
In a-scarches the N Boolean vector of assumptions is set to true, so these new n clauses arc 
directly satisfied. In /^-searches, the index i of the Ii items belonging to the target itemset 
are fixed, and the respective Ni variables are set to false. Recurring to Ni assumptions, 
the SAT solver behavior is similar to solvers that allow for clause removal and the level of 
performance is closely maintained. 

This is referred as incremental clauses encoding, because although no clauses need to be 
deleted, still new clauses need to be added. 

Observation. As the number of inserted clauses grows significantly with the number of 
iterations, an encoding with an increasing number of clauses penalizes the performance. 
Method. The strategy non-prone to insertions requires part of the reasoning to be done 
outside of the solver. The challenge is that all new clauses are relevant, and, thus, need 
to be maintained in memory. For instance, in a n— 5 itemset database, if the solver finds 
in initial iterations {I1I2I3} and {/1/3/5} (i.e. learned clauses arc, respectively, /4 V /5 
and I2 V 74), the new solutions need to satisfy all the learned clauses. In this strategy, 
the storage and reasoning is done separately to affect the values of the N Boolean vector. 
Fixed Following the introduced example, either JV4 or N2 and will iteratively assume the 

Clauses value false. So the binary clauses to be satisfied, described in the previous option, will 
guarantee that the respective items will be true. 

This method of defining assumptions, needs however to be complemented with a control 
variable y. This control variable is required for CMG search option to distinguish between 
the a- and /3-searches. Additionally, the following constraint needs to be satisfied: 

y + Eh + ENi > n + 1, (15) 

to guarantee that in /3-searches (y assumed to be false) an additional item is selected. 



Table 3: Encoding Options for Clause Removals 

The majority of available pseudo-Boolean solvers do not support the dynamic 
insertion or removal of clauses. Since the insertion of clauses is critical, when a 
solver of interest docs not allow for insertion, its implementation needs to be 
adapted (usually by turning visible invocations to SAT solver methods at the 
level of the pseudo-Boolean solver interface). Although the removal of clauses 
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is required within CMG and LD search options, it is rarely allowed and even 
non-easily disclosed recurring to the interface of SAT solvers. Two encoding 
adaptations to deal with removals are depicted in Table 3. 

3.5 Tunning Options 

Although multiple options for an efficient FIM were introduced, additional im- 
provements can be performed either by adapting the initial encoding or the 
solver behavior. 

Constraints Reduction 

From the multiple encoding adaptations that were studied, only one resulted in 
a significant performance improvement (for medium- to- high support thresholds). 
This adaptation is centered on a FIM-property that can lead to the reduction of 
the initial 0{m,n) constraints or binary clauses (if the solver is able to clausify 
all of these constraints) into only 0{m + n) constraints: 

At6T(-TtV(A,ej|z5,,-/.)) 

O AteT(A,e/|i3,.]3&V:^) 

o At6T(| Ail --Tt + Siei\D,rIi > 0) 

Polarity Suggestions and Parameters 

A simple and effective way to adapt the solver behavior is to change the po- 
larity suggestions. Since we are interested in the early finding of larger itemsets, 
polarity suggestions for 7j variables should be set to positive (this only degrades 
performance when maximal frequent itemsets have a very fine length). Orthog- 
onally, the polarity suggestions for Tt variables depend on a wide variety of 
factors (as the given frequency, density, iteration step and search strategy) , and, 
therefore, can be dynamically attributed in a scope-sensitive manner. 

Additionally, solver parameters as the decay factors for variable activity and 
clause activity can be dynamically tuned on the basis of sensitive analysis. 

Finally, the solver resolution can be adapted to be, for instance, sensitive to 
the difference between the transaction and item literals in a way that promotes 
a more focused search. This can result in significant performance improvements. 
Note, however, that the solver functionally must be ensured to support the ad- 
dition of new flexible constraints. 



4 Results 

This section details the undertaken evaluation of the previous options against 

state-of-the-art CP solutions. First, we visit the properties of our implementa- 
tion, then we describe the most significant observations, and, finally, we discuss 
the results to retrieve a set of implications. 
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Covered Options 

The supported encodings are the target FIM encoding and its dual formulation. 
The simple and expressive subsets and supersets negation are covered enumer- 
ation options. The modeling of (anti-)monotonicity properties at the encoding 
level arc not supported since they imply an exponential growth on the number of 
variables and constraints. All the advanced search options (simple, LSM, CMG 
and LD) are supported as well. Additionally, all the restrictions introduced were 
found when linking some solvers, and, therefore, they are addressed in our ex- 
periments. Finally, an extensive set of tunning options were implemented, with 
the most significant being the ones described in the previous section. 

Codification Alternatives 

Different codifications were defined with the goal of supporting an efficient and 
flexible interface with multiple pseudo-Boolean solvers. For instance, a codifi- 
cation in Java can only interface efficiently with solvers in Java through direct 
invocation. Otherwise, interaction needs to be done between executablcs (the de- 
veloped layer and the solver), leading to an additional latency as a result of the 
required synchronization between them. Additionally, the exchanged information 
requires parsing. This hampers the performance as information is extensively ex- 
changed in every iteration (note that the number of iterations is usually greater 
than for low frequencies). 

Two classes of SAT solvers were adopted. The first class comprises the solvers 
with open-source for whom all the covered options were implemented. The sec- 
ond class includes the solvers with undisclosed-somce with whom a simple goal 
was assessed: see how they performed for specific single-iterations against the 
alternatives. This was justified by the fact that since most of them do not sup- 
port neither the use of assumptions nor the insertion of new clauses required to 
perform enumerations. Although the full- feeding of these solvers with the en- 
coding for every iteration was tried, the fact that they do not keep the learned 
clauses in memory turned their performance impractical. These solvers are PBS 
[2], BSOLO [28] and WBO [27]. 

The adopted solvers belonging to the first class are SAT4J [26] and Min- 
iSat-l- [19]. Note that, since SAT4J is implemented in Java and MiniSat-|- is 
implemented in C++, two codifications for the target solution were supported: 
under Java and C++^. 

Datasets 

The adopted datasets were taken from the UCI repository^. The density of 
the dataset is defined by the average number of items per transaction divided 
by the size of the items' alphabet. Although the selected datasets are not large 
(note that optimal approaches suffer from scalability problems), they are dense 
by nature and, therefore, their use within traditional approaches is still largely 
computationally expensive. 



^ web. ist .utl.pt/ rmch /research/ software 

An additional third codification is available in C# upon request. 
^ http://archive.ics.uci.edu/ml/ 
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4.1 Observations 

The computer used to run the experiments was an Intel Core i5 2.80GHz with 
6GB of RAM. The algorithms were implemented using Java (JVM version 1.6.0- 
24) and C-I-+ (GCC 4.4.5) in 64-bit Linux (Debian 2.30.2) operating system. 

Comparative analysis 
Table 4 synthesizes the main results over UCI datasets for tunned MiniSat+ 

and SAT4J implementations under CMG search option and for the state-of- 
the-art CP performer, FIMCP. The proposed SAT-based solutions have a phase 
transition that relaxes for very low and medium-to-high frequency thresholds, 
as illustrated in Fig.l. Contrasting, FIAfCP behavior increasingly deteriorates 
with the decrease of the frequency threshold. 
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Table 4: Overall efficiency of the proposed solvers against FIMCP (seconds) 
*timeout; ^memory out; 



FIMCP is the best option when targeting either medium-to-high frequency 
thresholds or normal-to-low dense datasets. However, when the target problem 
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Fig.l: Phase transitions for varying densities of generated datasets (using MiniSat+) 

is based on low frequency thresholds over dense or large datasets, our adapted 
MiniSat+ solution is the choice. Note that these cases are the most common 
scenario in FIM problems, where the required frequency range to perform asso- 
ciation rules falls between 1 and 2%. An illustrative example of this advantageous 
behavior can be observed over the hepatitis dataset. The performance of FIMCP 
for this dataset is only scalable until frequency thresholds near and above 20%. 
For 9 <10%, FIMCP performance exponentially deteriorates with 9 decrements. 
Interestingly, under this same 9 range, MiniSat-l- is able to answer to the FIM 
problem in useful time as shown in Fig. 2. 

Time [seconds) 



600 

200 
O 



-MiniSat+ 
-FIM CP 



Freq. 



Fig. 2: MiniSat+ behavior for Hepatitis dataset under very low frequencies 

Since the performance of SAT4J is hampered by its resolution properties (not 
tunned to deal with low frequency thresholds and not able to clausify key con- 
straints) and by a bad memory management (dependent on a garbage collector), 
the adoption of MiniSat-l- or FIMCP is overall preferred. 



Selecting SAT-based solvers 

The selection of best performer selections based on the inputted frequency and 
dataset properties implies an extensive analysis of the behavior of the solvers 
across different axes of choice. Table 5 synthesizes the main results of the un- 
dertaken analysis, which are further detailed in Appendix C. 

4.2 Deepening the analysis behind SAT vs. CSP 

Up to now, the most efficient CP approaches map FIM as a Constraint Satisfac- 
tion Problem (CSP) [34]. A CSP problem is specified by a finite set of variables 
V, an initial domain D (which maps every variable v € V to a, finite set of values 
D{v)), and a finite set of constraints C in first-order logic. The goal is to output 
the variable domains which satisfy all constraints. Thus, the solution to a FIM 
problem can be directly retrieved from the codification of equations (5) and (8). 
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Axis 



Observations 



o Although the negation of subsets or supersets within iterations lead to significant 
performance improvements, the level of impact depends on the search ability to 
early discovery the largest itemsets (as in LSM, CMG and LD); 

o Interestingly, the explicit negation of each subset/superset is preferred over the 
expressive onc-clause-only negations when there is a high number of iterations as 
solvers are able to remove duplicated negated subsets, while in the later option 
there is an increasing redundancy that may hamper the search performance. 



Enumeration 
options 



o CMG search is overall the best choice (/3-searches are efficient); 

o Simple search methods are only able to perform searches with acceptable effi- 
ciency on high frequency thresholds over very sparse datasets, and should only be 
adopted when resolution promotes an early discovery of larger itemsets; 

o LSM searches are only competitive when a few number of iterations is performed 
since each search is a lot heavier than a full CMG search (including one a-search 
and multiple /3-searches); 

o LD is competitive with CMG for medium frequency thresholds (0.05 < G < 
0.2) on small datascts. LD performance quickly degrades with their growing size. 
Although sat searches arc light (focused on discovering frequent itemsets with a 
given length) a significant overhead is added by the fixed number of heavier unsat 
searches, particularly for a low-to-medium number of maximal frequent itemsets. 



Search options 



o Clause- oriented encodings seems to be preferred over minimal encodings for 
lower-to-medium frequency thresholds. Although minimal encodings have signifi- 
cantly fewer constraints, they are not easy to clausify; 

o Restricted encodings penalize the performance significantly (5-25%). 



Encoding options 
(restrictions) 



o Positive suggestions arc adequate for low-to-medium frequencies (indicate a pref- 
erence towards larger itemsets), while negative suggestions are indicated for higher 
frequencies (since large itemsets are not frequent, more conflicts are found increas- 
ing the number of backtracks and potential restarts); 



Tunning options 



o SAT4J solver is the option for high frequency thresholds and for very sparse 
datasets since it accepts non-constrained encodings and its resolution is more tuned 
to find finer itemsets; 



Implementation 
options 



o MiniSat+ solver is the natural choice for the rest of the options due to, among 
others, the tunned paramctcrizations as polarity suggestions and decaying factors. 



Table 5: Key observations for the target SAT-based solvers across six dimensions 

The first key concept used to speed up the search is constraint propagation to 
reduce the domains of variables such that the domain remains locally consistent. 
To maintain local consistencies, propagators are used to remove values from a 
domain that can never satisfy a constraint. 

Besides this property, constraint-based solvers are well-prepared to deal with 
certain types of constraints. Two examples are: the summation constraint and the 
reified summation constraint [22]. Flexible solvers as Gecode are well-prepared 
to deal with them. The FIM problem heavily relies on these constraints. 

Let :r G y C V be a variable with an associated weight wj^? a summation 
constraint as the following form [22]: 

^xeV^xX > (16) 

The propagator task is to discover as early as possible whether the constraint is 
violated (i.e. whether the upper-bound of the sum is still above the threshold). 

In a reified summation constraint, the evaluation of a summation constraint 
depends on a Boolean variable b (as adopted in the target frequency constraints): 

b C {usually C = Ua^^yWa^x > 9) (17) 
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The most important propagation that occurs for this constraint is the one that 
updates the domain of b [22]. When an item variable is fixed, the following is 
possible for the coverage constraint: 

if for some t: - Dti}!^'"- > then remove 1 from D{Tt) 

if for some t: - Du)!^"''' = then remove from D{Tt) 

Once the domain of a variable Tt changes, the support constraint is activated. 

The support constraint is simply a summation constraint, which checks whether: 
i^tg^T™"^ > 9. If this constraint fails, CSP solvers do not need to branch further 
and, therefore, can backtrack. 

Contrasting, the mapping of reified frequency constraint into SAT is not 
concisely handled, leading to the generation of nm clauses. Jointly these ob- 
servations and the fact that SAT solvers are not able to expressively deal with 
emimcrations (in particular, when an arbitrary number of clauses arc added be- 
tween iterations) justify the poor performance of the developed solutions across 
several datasets and frequencies. 

4.3 Discussion 

From the previous observations, several implications with impact on when to 
use and how to tune SAT-based solutions can be retrieved. 

When to use SAT-based solutions: 

— for low frequency thresholds, with most promising results on dense datasets 
(for instance, datasets typically adopted for classification tasks with nominal 
attributes with few labels, or with numeric attributes that arc binarized using 
thresholds). The level of frequency used to opt for a SAT-based solution 
depends on the density: if density is near 45-50% the frequency can reach 
10%, while for other cases the frequency should not exceed 4%; 

— when the problem is not defined as a complete enumeration, but aims to find 
a fixed number of patterns of interest, or to verify its satisfiability; 

— in specific cases for higher frequencies (mainly between 10% and 25%) when 
the FIM problem is formulated as its dual; 

How to tune SAT-based solutions: 

— use an expressive enumeration strategy such as the compact subsets negation, 
change polarity suggestions (item-related variables to positive and transaction- 
related variables to false) and prefer clause-oriented encodings; 

— adopt the search strategy according to the target instances and problem: 

— by default and for the majority of cases, CMG search is the most efficient; 

— simple search should only be adopted when the polarity suggestions can 
be set according to the proposed guidelines for datasets with a good 
distribution of frequent itemsets among transactions (otherwise the use 
polarity suggestions do not guarantee that, within each iteration, large 
itemsets are discovered); 
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- LSM search should be selected when we are interested in a subset of 
itemsets with major interest (i.e. when there is the requirement of ob- 
taining the k largest maximal frequent itemsets). In particular, the value 
of k should be significantly lower than the total number of maximal fre- 
quent itemsets (phase transition), otherwise a CMG search should be 
preferred (although there is the need to generate all maximal frequent 
itemsets since it does not guarantee if a maximal frequent itemset is one 
of the k largest); 

— be able to translate the flexible user-defined constraints into a SAT formula, 
by defining generic methods to translate equivalences and implications. Note 
that the distinguishing feature of CP is that it provides general principles for 
solving problems with any type of constraints. Although some streams claim 
that this observation sets it apart from SAT solving [9], we show that this 
is a simple step as it is illustrated by the following constraints' encodings: 

- P = I" ^ A,ex{l! ^ I^) 

- ieip^lf 

- p Ii = r^ A,ei(/[ ^ /f A -If) 

- ipnii = r^ A,ei(/r ^ /f A J«) 

- /P U 79 = F -> A,6i(/[ ^ /f V If) 

- coverItems{I'^,. .,!'') Aigi(Vjei..fc/^) 

- coverTrans{I^,..,I^) Ater(Vjei..fcT/) 

Unfortunately, both the proposed SAT-based solution and any other CP so- 
lutions arc featured by high computational complexity and their straightforward 
implementations are not applicable to large data sets. This advocates the need 
for local learning where transactions are partitioned and multiple criteria can be 
used for the integration of the frequent patterns found within each fragment (for 
instance, through voting techniques [32]). Alternatively, the FIM constraints can 
be used to compute information gain metrics [14] as the entropy-measure or its 
dual formulation as a basis for pruning techniques. 

5 Related research 

The main research streams approaching PM within a CP framework can be clas- 
sified according to: i) the extent to which an user can define novel constraints and 
combine them, and it) according to the type of supported constraints. Within 
the first axis, although approaches as Pattcrnist [10], Molfea [18] and MusicDFS 
[37] support a predefined number of constraints, they do not allow for the ex- 
pressive definition of novel constraints as FIMCP [17], PattCP [21] and GeMini 
[30]. In the following section the constraints covered by existing approaches and 
variations to the FIM problem are briefly presented. Finally, other related work 
with potential relevant contributions is covered. 

Extending Constraints. 

Expressive CP models [30, 24, 21] enable the fiexbile definition of constraints us- 
ing: constants including numerical values, items as A, specific itemsets as {A, B}, 
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and specific transactions as tr; variables {y^ch and yt\cTt)', set and numerical 
operators (U, n, \, x, +, — ); and function symhols involving one or several terms, 
which can be built-in as overlapTrans{I , J) — \cover{I)r]cover{J)\) or defined by 
the user as area{T)=freq{T)xsize{I) or coverage{I, J)=freq{lLiJ)xsize{lr\J). 

These constraints can be used to filter the patterns of interest. However they 
can be used with a different purpose: closed itemset mining, discriminative pat- 
tern mining, pattern-based clustering and pattern-based classification. A recent 
direction is taking into account the relationships between local patterns to pro- 
duce global patterns or pattern sets [24]. Despite their importance, there are 
very few attempts to mine patterns involving several local patterns and the 
existing methods tackle just particular cases by using devoted techniques [22]. 
In [30, 9], the importance of adopting a declarative CP-based approach to mine 
global patterns was highlighted by several examples coming from clustering tasks 
based on associations. Due to the complexity of this task and its easy modeling 
through constraints, CP-based approaches as pattern teams [25] have been pro- 
viding encouraging results when compared against heuristic-based approaches 
that consider the added value of a new global pattern given extensive combina- 
tions of selected patterns [11]. 

Problem Variations. 

The introduced FIM approach can be extended and adapted in different ways. 
One of the most common is the /c-pattern set mining [21,30]. Unlike FIM, the 
problem of fc-pattern set mining is concerned with finding a set of k related pat- 
terns under constraints. The discovery of k representative patterns often uses 
probabilistic models for summarizing frequent patterns [31] and other condensed 
representations of patterns as the dataset compression using Minimum Descrip- 
tion Length Principle [38] . These approaches mainly aim at reducing the redun- 
dancy between patterns and, like our SAT approach, often focus on maximal 
frequent patterns. The fc-pattern set mining problem is a very general problem 
that can be instantiated to a wide variety of mining tasks including concept- 
learning, rule-learning, re-description mining, conceptual clustering and tiling 
[21,30]. 

Other Relevant Work. 

Instead of mining rules that rely on frequent itemsets, some approaches en- 
code the problem within the CP framework using (compact) reducts [5], i.e., 
subsets of most informative attributes. In this stream, a SAT representation is 
often formulated as an Integer Programming (IP) model to solve the minimal 
reduct constraints. In [32], an extensive research is done over Boolean reason- 
ing methodologies for Rough Set theory. Two problems are encoded: the search 
for reducts and the search for decision rules which are building units of many 
rule-based classification methods. 

Another important direction is the verification whether PM constraints are 
satisfiable. This is a one- iteration-only decision problem that, according to our 
previous analysis, can be handled using SAT solvers. An example of this task 
with multiple applications in privacy preserving data mining, condensed repre- 
sentations and the FIM problem, is the called FREQSAT problem [12]: given 
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some itemset frequency-interval pairs, does there exist a database such that for 
every pair the frequency of the itemset falls into the interval? That is, given a 
set of frequency constraints C = {freq{Ij) € [lj,Uj],j = l...m}, veriiy if there 
exists a database D over UjL^Ij that satisfies C ^. The problem can be further 
extended to include arbitrary Boolean expressions over items and conditional 
frequency expressions in the form of association rules. Additionally, FREQSAT 
is equivalent to probabilistic satisfiability (pSAT) [14]. 

6 Concluding Renicirks 

The use of SAT within the CP framework to address constrained-pattern min- 
ing problems was demonstrated to be valid: not only the adopted variables are 
Boolean, but expressive constraints can be easily mapped into a Boolean for- 
mulae. Multiple search options, enumeration strategies, encoding alternatives 
and parameterizations were studied in order to improve its performance. Under- 
standably, these adaptations aim to orient SAT reasoning to the main properties 
of frequent itemset mining, without loosing the ability to flexibly support novel 
constraints. 

Experimental results show that SAT-based solutions are competitive with the 

state-of-the-art CSP solutions for an important range of frequencies (range com- 
monly adopted to discover association rules or to perform classification tasks). 
The efficiency problems found for higher frequencies mainly result from the fact 
that SAT was not developed with the intent of perform enumerations under 
the addition of new (and potentially confiicting) clauses. Finally, a set of guide- 
lines were introduced to understand when to use and how to tune SAT-based 
solutions. 

Future Work. In the next steps one should expect: 

— the extension of this approach to evaluate its impact not only in terms 
of efficiency but also in terms of accuracy for FIM-based problems as the 
discriminative-FIM problem; 

— the; exploitation of SAT limits and performance in comparison to CP ap- 
proaches under an intensive use of constraints; 

— the development of hybrid approaches with clear rules to select and tune 
the best performer imdcr certain conditions: i) the selected frequency, ii) 
the nature of the problem constraints and relaxations, and Hi) the dataset 
properties - by order of relevance: the density, the number of items, the 
item-to-transaction ratio and the number of transactions; 

— the exploitation of potential improvements from the codification of the adopted 
enumeration and search strategies at the encoding level; 

^ suppose that the following set C of frequency constraints C is given: {/reg({a, 6}) G 
[3/4,l];/re9({a,c}) e [3/4, 1], /re<;({b, c}) 6 [3/4, 1] , /reg({d, e}) e [3/4,1], /req({d, /}) e 
[1/2, l],/reg({e, /}) e [1/2, 1], /reij({a, 6, c, d, e, /}) = 0}; C is in FREQSAT, because it is sat- 
isfiable by the following database: D = {(1, {o, 6, c, d, e}), (2, {a, 6, c, d, e}), (3, {a, 6, c, d, e}), 
(4, {a, 6, c, d, /}), (5, {a, b, c, e, /}), (6, {a, b, d, e, /}), (7, {a, c, d, e, /}), (8, {6, c, d, e, /})} 
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— the adoption of more expressive constraints besides (anti-)monotonicity, in- 
cluding extensions based on the inclusion-exclusion principle [13] and other 
frequency-based relations [7,35]; 

— the finding of patterns in continuous data (as, for instance, required in many 
bioinformatic datasets) , which may require discretization techniques beyond 
support- vectors, rough set and Boolean reasoning theories [32] ; 

— the mining of frequent itemsets in structured data as sequences, trees and 
graphs. New formulations are requires! to represent these problems under 
CP, which may not be trivial to encode when recurring to a fixed number of 
features or variables; 

— the assessment of SAT approaches to perform constraint-based clustering 
and constraint-based classifier induction (not necessarily relying on frequent 
itemsets) . In constraint-based clustering the challenge is to cluster examples 
when additional knowledge is available about these examples, for instance, 
prohibiting certain examples from being clustered together (so-called cannot- 
link constraints) [30,34]. Similarly, in constraint-based classifier induction, 
one may wish to find a decision tree that satisfies size and cost-constraints 
[8] . In traditional data mining, the relationship between itemset mining and 
constraint-based decision tree learning was studied in [33], however such 
relation was not yet exploited in a CP setting; 

— the adoption of SAT verification (as the previously introduced FREQSAT 
problem [14]) using deduction rules to prune/constrain the search of frequent 
itemsets. The monotonicity rule is a very simple example of deduction. More 
advanced rules, as the partial frequency available for some itemsets, bound 
on the frequencies of itemsets yet to be counted. Examples of deduction rules 
to improve pruning and speed-up FIM approaches are given in [6, 16]; 

— the study of potential techniques to turn SAT and other CP-based ap- 
proaches scalable. Options may include the local scanning on dataset parti- 
tions [32] or the use of data stream mining approaches [15]. 
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A Positive Literals 



Some solvers only admit a simplified and constrained pseudo-Boolean notation as input. 
An example is the exclusion of negated literals, which may not be trivial to handle. The 
translation of coverage constraints into constraints with non-negated literals is trivial: 

^teT{^^eI\DuhTt V -^h) A (Tt V {V ^eI\DuI^))) 

^ f\teT{^^eI\Du{Tty h < l)A(TtV(V,6ip,,/») > 1)) 

The frequency constraints were translated with the goal of maintaining the advan- 
tageous > operator: 

^^eI{e^h + SteT\DuTt >0) 

o A^eI{-e^I^ + SteTlDuTt > 0) 

In this fashion, solvers as Minisat-f- [19] can be tested without significant overhead, 
as most of them internally are able to clausify both of the pseudo-Boolean constraints 
defined for the coverage restrictions. 



B Results 

The following appendix sections detail the observations made in Table 3. The adopted 
datasets for these experiments are either: real UCI datasets (distributed in Fig. 3 ac- 
cording to their properties) or generated datasets (with customized number of items, 
number of transactions and density) whose generation depend on a biasing parameter 
7 for the emerging of patterns according to a distribution similar as real datasets. The 
small but representative zoo dataset is often used to compare options. 



Nr. Transactions 
10000 

80D0 

EOCO - 

4OD0 
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- 
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— I Nr. Items 
200 



Fig. 3: Characterization of the target datasets (high density in black, low in gray) 



B.l Enumeration options 

As expected, the adoption of the monotonicity principle within each enumeration have 
significantly impacted the performance of SAT solvers. According to an extract of the 
results in Table 6, two main observations can be drawn. First, the level of impact by 
(either simply or expressively) negating subsets depends on the search option and on 
the target frequency. 

Understandably, such impact depends on the ability to early discover the largest 
itemsets. This is the reason why subsets negation is key to LSM and CMG searches 
(its development relies on this principle), and important, but not so significant, within 
simple search methods. Additionally, cutting space through subsets negation is more 
critical for low frequencies, as an increasing length and number of itemsets is observed. 

Second, the choice of when to explicitly insert a negation of each subset or to 
expressively insert one unique clause requiring the selection of an item not observed in 
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Search 


Enumeration 


e=o,2 


61=0,4 


61=0,6 


Optimization (SAT4J) 


Simple 

Expressive Simple 
Subsets Negation 
Expr. Subsets Ncg. 


1514,0 
94.3,7 
31,3 
18,2 


230,0 
89,2 
2,6 
1,8 


1,0 
0,1 
0,1 
0,1 


Decision (SAT4J) 


Simple 

Expressive Simple 
Subsets Negation 
Expr. Subsets Neg. 


1009,4 
589,8 
208,8 
183,0 


12,6 
5,4 
4,8 
6,5 


0,5 
0,2 
0,1 
0,2 



Table 6: Comparison of enumeration strategies for the zoo dataset (seconds) 



the found itemset is not simple. This choice depends on two factors: the total number 
of similar itemsets and the relative length of the found itemset. In the first case, when 
we have multiple similar frequent itemsets (sharing the majority of items), the explicit 
insertion of repeated negated sub-itemsets is detected by the solver, and the repeated 
clauses are removed, while in the expressive insertion every new clause is inserted as- 
is as a new problem constraint. In the second case, when the length of itemsets is 
small in comparison with the item alphabet this means that, although the negated 
sub-itemsets inserted by an expUcit negation strategy generate multiple clauses, the 
number of clauses is not high and the number of literals per clause is low. Contrasting, 
although expressive negation strategies only add a clause per iteration, the number of 
literals can be significantly high and may hamper the resolution performance. These 
observations claim for an increased attention on the strategy selection based on the 
inputted frequency and dataset properties. 

B.2 Search options 

The first observation coming from the search option results is that, although CMC 
search is overall the best choice, the performance of the searches highly vary accord- 
ing to the dataset density and input frequency. Maximal-oriented searches (as LSM, 
CMG and LD) perform better for low frequency thresholds in dense datasets and for 
high frequency thresholds in sparse datasets. Table 7 performs a two-axes evaluation 
- over generated datasets with varying densities and over a fixed dataset with varying 
frequency thresholds. 



Search 




Dense Gen. Normal Gen. Sparse Gen 


Simple 












LSM 




3491,4 




176,2 


75,4 


CMG 




268.6 




97.5 


62.9 


LD 




12!)(i. 1 






2!)1.2 




0=0,02 


0=0,05 


0=0,1 


0=0,2 


0=0,4 


Simple 


1020,1 


978,9 


792,3 


183,0 


6,5 


LSM 


5,9 


14,1 


26,8 


17,0 


1,8 


CMG 


4,9 


7,4 


9,6 


7,6 


1,0 


LD 


5,4 


9,8 


9,5 


7,4 


3,4 



Table 7: Comparison of search options on generated datasets with different densities 
(^=0,05) and on zoo dataset with different frequencies using SAT4J (seconds) 
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0=0,02 


61=0,05 


0=0,1 


0=0,2 


0=0,4 


Simple 


Number of searches 
Average time per seareir 


62032 
2,3 


30369 
2,5 


9354 
3,7 


4087 
7,6 


367 
49 


LSM 


Number of searches 
Average time per search 


119 

8 


221 
23 


298 
43 


267 
78 


90 
195 


CMG 


Number of a-searches 

Average time per main search 

Number of /3-searches 

Average time per constrained search 


119 

2,2 
244 
1,02 


221 
2,2 
783 
1,02 


298 
2,8 
1075 
1,11 


267 
4,3 
622 
1,22 


90 
38 
134 
9,1 


LD 


Number of unsat searches 
Average time per main search 
Number of sat searches 
Average time per constrained search 


32 
7,2 
119 
2,7 


32 
8,4 
221 
3,2 


32 
15,3 
298 
3,9 


32 
27,4 
267 
6,3 


32 
82 
90 
28,3 



Table 8; Search options in MiniSat+: number of searches and avg. time (mihseconds) 

Simple search methods are only able to perform searches with acceptable efficiency 
on high frequency thresholds over very sparse datasets. In fact, even when adopting 
the monotonicity principle, simple searches do not scale as it is visible based on the 
increasing number of iterations (Table 8) as frequency thresholds decrease. Simple 
searches should only be adopted for an implementation that is able to promote the 
early selection of larger itemsets by, for instance, adjusting the polarity of the item 
variables. If this is the case where maximal frequent itemsets are guaranteed, simple 
search should be able to outperform constrained CMG. 

LSM searches are only competitive when a few number of iterations is performed 
since each search is a lot heavier than a full a-search (include multiple /3-searches) as 
it has a larger space to exploit. This is usually the case where multiple similar itemsets 
(e.g. ACD, ACE, ADE, BCD, BDE) collapse into unique maximal frequent itemsets 
(e.g. ABCDE) as a result of the threshold frequency decreasing. 

CMG searches smooths the heavy computational cost of discovering multiple similar 
itemsets as each /3-search is very light as depicted in Table 8. The average time per 
,8-search is less than a half of an a-search (and this relation even decreases under more 
larger datasets). 

Finally, LD is as competitive as CMG for medium frequency thresholds (0.05 < 
6 < 0.2) on small datasets (LD performance quickly degrades with their growing size). 
Instead of performing a- and /3-searches, it performs one unique type of search for a 
fixed length of itemsets, which is a very focused search. In order to cover all possible 
lengths, n of these searches are unsat. The number of sat searches is equal to the number 
of maximal frequent itemsets. A search returning sat has a similar performance as an 
a-search, being the additional overhead added by a fixed number of unsat searches 
(understandably, not varying with the input frequency). This overhead can be critical 
as a search returning unsat is largely heavier than a search returning sat or a /3-search 
(see Table 8). Therefore, the adoption of this search option essentially depends on how 
the number of items (defining the number of unsat searches in LD) compares to the 
number of maximal frequent itemsets (influencing the number of /3-searches in CMG). 

B.3 Encoding options 

Two main observations can be derived from the experimental tests over different en- 
codings (see Table 9). First a clause-oriented encoding seems to be preferred over a 
constrained encoding. Among other aspects, the adopted constrained encoding requires 
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the use of non-negated variables and additional clauses and variables to support later 
clause-removals. This is particularly true for SAT4J as this solver is not able to clausify 
some of these constraints. 



Encoding 


e=0,02 


9=0,05 


e=o,i 


e=o,2 


e=o,4 


9=0,6 


Clause-orient cd 


4,3 


7,9 


9,1 


7,1 


1,0 


0,3 


Alternative 


4,8 


10,9 


29,0 


7,0 


1,0 


0,2 


Constrained 


5,3 


8,8 


46,8 


8,4 


2,0 


0,4 


Alternative Constrained 


5,9 


12,9 


60,7 


8,3 


1,3 


0,2 



Table 9: Comparison of encodings for the zoo dataset using SAT4J (seconds) 



Second, alternative encod- 
ings that aim to reduce the num- 
ber of clauses from 0{mn) to 
&{m + n) (see section 3.5), may 
not result in significant improve- 
ments as the solver instead of 
having to deal with nm simple 
binary clauses has to deal with 
m complex constraints. The choice of whether to adopt or not this encoding mainly 
depends on the inputted frequency threshold (adopt for 9 > 20% and avoid its use 
under low frequency thresholds). 




clause-oriented encoding 
^—Alternative encodinE 
^—Constrained encoding 
^—Alternative Constrained encoding 



B.4 Tunning options 

As depicted in Fig. 4, two simple variable polarity suggestions were undertaken. The 
positive suggestion is the choice for low frequency thresholds. This results from the fact 
that since item variables are set to true, larger itemsets tend to be initially identified. 

Time (seconds) 
12000 -| 




-I , 1 1 1 1 1 , 1 

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,1 Frequency 



Fig. 4: Positive vs. negative polarity suggestion for the zoo.txt dataset (miliseconds) 

However, the negative suggestion becomes the option for frequencies below 20% 
in many datasets (~30% in Fig. 2). This derives from the fact that in the positive 
suggestion many of the initial large itemset options will not be verified with this higher 
thresholds, so significantly more conflicts are found within each iteration leading to an 
additional inefficiency related to the number of backtracks. 

Advanced polarity suggestions should not only take into account an overall sugges- 
tion for all the variables, but also be able to: i) differentiate the polarity suggestion 
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between item variables and transaction variables, and ii) be able to locally adapt sug- 
gestions based on task-driven heuristics (e.g. a low number of occurrences of an item 
variable relative to others may result in a negative polarity suggestion). 



B.5 Implementation options 



Interestingly, Fig. 5 shows that each adopted SAT solver has a unique behavior when 
answering to the FIM problem. SAT4J is the best option when we are targeting high 
frequency thresholds and when mining very sparse datasets. Beyond its resolution speci- 
ficities, this is also a result of accepting non-constrained encodings (including, among 
others, negated variables, differentiated insertion of clauses and pseudo-Boolean con- 
straints, and clause removal). SAT4J main problems are related to memory inefficiency 
when dealing with large datasets and with the fact that its algorithm is more tuned to 
find smaller itemsets, which hampers the behavior of CMG searches since it exponen- 
tially increases the number of /3-searches. 



Time (seconds! 




'MiniSat4 
■S/M4J 



0,02 0,05 0,1 0,2 0,4 Frequency 



Fig. 5: Implementation options for the zoo dataset under the best method 



MiniSat-l- is the natural choice for the rest of the options - dense datasets and low 
frequency thresholds. This is not only a consequence of the resolution methods or of 
C-|— I- additional efficiency, but also a result of multiple improvements related to the 
solver parameterizations, with positive polarity suggestion being the most significant. 

The discussion of the behavior of other adopted solvers as PBS [2], BSOLO [28] 
and WBO [27], is out of the scope as a result of an excessive latency caused by the 
need to call them as executables. The successive memory refreshes among iterations 
and recurrent need to parse and clausify formulas distorts any potential analysis. 



B.6 Fixing phase transitions 




(a) Phase transitions for varying [titcms (i,) p^^^^^ transitions for varying [ttransactions 



Fig. 6: Phase transitions for varying size of generated datasets (using MiniSat-f) 
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The study of phase transitions on generated datasets with varying density (Fig.l), 
number of items (Fig. 6a) and number of transactions (Fig. 6b) led to two main observa- 
tions. First, density is the property of datasets with highest impact on the performance 
of the proposed solution. The variation of a few percentage points can exponentially 
hamper the performance of our solution. Second, and understandably, the size of the 
dataset is also a decisive criterion affecting the behavior of our solution. A variation 
on the number of items is more critical than on the number of transactions. Although 
our solution hardly scales for a number of items above 100, for low frequencies in dense 
datasets or high frequencies in sparse datasets it can handle up to 10.000 transactions. 



C SoftwEire Capabilities 

— flexible selection and combination of the search options, enumeration strategies, 
target datasets, frequencies of interest and optimization parameters (a fragment 
of the testing code in Java is depicted below); 

— extensible codification that gives a basis to model real-problems using simple and 
enumeration-centered SAT or PB: 

— Utils package contains a general set of encoding functionalities as the generation 
of .opb and .cnf files adapted to the restrictions of a particular solver; 

— Solver package provides the interface to SAT solvers (supporting both a direct 
interface through methods invocation or via executables) and the ability to 
select multiple search and enumeration options in a task-independent manner; 

— parameterizable generation of datasets and their expressive and usable adoption 
to test limits of performance; 

— fiexible addition of pattern mining constraints (by extending the SatPM class); 

1: List<Dataset> datasets = DatasetGeneration.getDatasets(); 

2: List<SATHandler> handlers = SolverOptions.getSolvers(); 

3: List<Strategy> strategies = StrategyOptions.getStrategies(); 

4: for( Dataset dataset : datasets) 

5: dataset. encodingOption(encodinglD); 

6: for(SATHandler handler : handlers) 

7: handler. setPolarity(polaritylD); 

8: for(Strategy strategy : strategies) 

9: for(double freq=0.01; freq<0.8; freq+=0.01) 

10: results. add(new StandardFIM(dataset,handler,strategy).run()); 

The software is available in web.ist.utl.pt/rmch/research/software. 



